Mel-Spec to WAV GAN + MIDI LSTM¶
Team Roster:
- Matteo Perona (PID: A16931052)
- Adrian Ye 2 (PID: A17018187)
- So Hirota 3 (PID: XXXXXXXX)
Overview of Generation Tasks¶
In this project, we tackle two distinct but complementary generative modeling tasks in the audio and symbolic music domains.
Task 1 focuses on neural vocoding: generating high-quality audio waveforms from mel spectrograms. We employ a GAN-based architecture with advanced convolutional and residual blocks, leveraging multi-resolution spectral and perceptual losses to synthesize realistic audio from compressed spectral representations (conditional, continuous).
Task 2 addresses symbolic music generation, where the goal is to generate a drum sequence conditioned on a given instrument track. Using the GigaMIDI dataset, we preprocess and tokenize MIDI files to create paired instrument and drum sequences, and train an LSTM-based model to capture the temporal and structural dependencies in music. Together, these tasks explore both the waveform and symbolic levels of music generation, demonstrating the application of deep generative models to complex, real-world audio and music problems.
Task 1¶
Exploratory Analysis, Data Collection, Pre-Processing, Discussion¶
Context:¶
Where does the dataset come from? What is it for, how was it collected, etc? Sample text here
The dataset is from the MTG-Jamendo dataset
It is an open dataset that is intend to be used for music auto-tagging. The music was gathered from the Jamendo website from content uploaders under the Creative Commons License, with methodology and tagging information available in the public repository.
The Jamendo Dataset comes preprocessed with subsets of the whole dataset. These included raw_30s audio files, raw_30s low quality, raw_30s mel specs, and autotagging-moodtheme for audio, low quality audio, and mel specs. The autotagging also comes with a tsv that contains all the tags for each file.
For our purposes, we used the autotagging-moodtheme/audio-low subset.
Discussion:¶
Report how we processed the data (or how it was already processed)
To process the data, we used a custom script that split up each individual audio file into n length chunks (padding if the last chunk was too short), defined in our config so we can split files into 5s chunks or 15s chunks for example.
We then used a separate script to generate the mel spectrograms for every single chunk. In order to feed it to the model.
Code:¶
Support our analysis with tables, plots, statistics, etc
Original Audio Files¶
import os
from IPython.display import Audio
# Path to the directory containing MP3 files
audio_dir = 'data/raw/audio'
# List all files in the directory
all_files = os.listdir(audio_dir)
# Filter for MP3 files
mp3_files = [f for f in all_files if f.endswith('.mp3')]
# Display the first 3 MP3 files
for i, mp3_file in enumerate(mp3_files[:3]):
file_path = os.path.join(audio_dir, mp3_file)
print(f"Displaying: {mp3_file}")
display(Audio(file_path))
if i < 2:
print("-" * 20) # Separator
Displaying: 1162600.mp3
-------------------- Displaying: 847200.mp3
-------------------- Displaying: 1012000.mp3
Preprocessed Audio and Mels¶
# Paths to the directories containing processed audio and mel spectrograms
clean_audio_dir = 'data/clean/audio'
clean_mels_dir = 'data/clean/mels'
# List files in each directory
clean_audio_files = os.listdir(clean_audio_dir)
clean_mels_files = os.listdir(clean_mels_dir)
# Filter for audio files (assuming they are .wav after processing)
clean_audio_files = [f for f in clean_audio_files if f.endswith('.wav')]
# Assuming mel spectrograms are saved as .npy files
clean_mels_files = [f for f in clean_mels_files if f.endswith('.npy')]
import numpy as np
import matplotlib.pyplot as plt
import librosa.display
def show_melspec(path):
# === Load your .npy mel spectrogram ===
# npy_file = 'clean/mels/00/example.npy' # Change this to your actual path
npy_file = path
mel_spectrogram = np.load(npy_file)
print(mel_spectrogram.shape)
# === Plot it ===
plt.figure(figsize=(10, 4))
librosa.display.specshow(
mel_spectrogram,
sr=12000, # Your sample rate
hop_length=256, # Same hop length used in generation
x_axis='time',
y_axis='mel',
fmin=0,
fmax=6000,
cmap='magma'
)
plt.title('Mel Spectrogram')
plt.colorbar(format='%+2.0f dB')
plt.tight_layout()
plt.show()
print("--- Displaying 3 Clean Audio Files ---")
# Display the first 3 clean audio files
for i, (audio_file, mel_file) in enumerate(zip(clean_audio_files[3:6], clean_mels_files[3:6] )):
file_path = os.path.join(clean_audio_dir, audio_file)
print(f"Displaying: {audio_file}")
display(Audio(file_path))
show_melspec(os.path.join(clean_mels_dir, mel_file))
print("-" * 20) # Separator
--- Displaying 3 Clean Audio Files --- Displaying: 13400_chunk008.wav
(96, 704)
-------------------- Displaying: 12100_chunk005.wav
(96, 704)
-------------------- Displaying: 47400_chunk002.wav
(96, 704)
--------------------
Modeling¶
Context:¶
How did we formulate the task as an ML problem? E.g. what are the inputs, outputs, what is being optimized? What models are appropriate for the task?
We originally wanted to create an autoencoder to generate unconditioned music. We were trying to generate mel spectrograms as output and convert them to audio using an existing algorithm. We quickly realized that existing methods were not great, so we made that our objective.
Therefore it was easy to decide things: We want to input a mel spectrogram. We want to output a wav file, that is almost the same as the original We know we want to optimize similarity to the original, because we wanted to reconstruct /guess the lost data.
We were inspired to use GANs due to the successful implementations of this idea were bigVGAN and MelGAN.
We were inspired by bigVGAN’s architecture, though we modified that pipeline to be MRFBlocks instead of their bigVGAN’s AMPBlock.
Discussion:¶
Discuss the advantages and disadvantages of different modeling approaches (complexity, efficiency, challenges in implementation, etc.)
We know that transformers or diffusion models could have worked, but their complexity and time to train was not in our best interest.
We had some issues early on with training
We had few parameters and could turn it up We had a small batch size but could turn it up
We had issues with the generator not generating frequencies We solved this using skip connections, and re-injecting the original mel spec multiple times in our connections to preserve base frequencies.
Code:¶
Walk through your code, explaining architectural choices and any implementation details Sample text here
Evaluation¶
Context:¶
How should your task be evaluated? What should be the properties of a “good” output? What is the relationship between the objective being optimized by your model (e.g. perplexity) and musical properties (e.g. does it follow harmonic “rules”?) versus subjective properties?
Our task should be evaluated on how similar the generated WAV is to the original audio.
Objective tasks included STFT Spectrogram similarity of the original to generation Mel Spec similarity of the original to the generation Minor feature were waveform loss and feature loss, but minimal
Subjective properties: Obviously it should be how clear and similar it is to the original
Discussion:¶
What are some baselines (trivial or otherwise) for your task? How do you demonstrate that your method is better than these trivial methods?
Sample text here
Our baseline was converting our mel-spec directly to a wav file, with minimal preprocessing. Because Mel spec conversions are lossy, the output is not that great.
With our model, we hoped to be able to regenerate the original audio using only the Mel spec. Therefore the baseline is a direct conversion from mel to wav.
Using subjective comparison between the baseline and our model’s output, we can definitely say that our model performed better than our baseline. Unfortunately we couldn’t figure out how to enable our model to have more clarity in its generations, which is the main issue with our model at the moment. If we can find a way to remove some of this noise, I think the model would do a lot better.
We think some hyperparameter tuning could be the main way to reduce noise.
Code:¶
Walk through the implementation of your evaluation protocol, and support your evaluation with tables, plots, statistics, etc.
Sample text here
import os
import soundfile as sf
import librosa
def wav_to_mel_to_wav_and_display(wav_path, sr=12000, n_fft=2048, hop_length=256, n_mels=96):
print(f"Processing: {os.path.basename(wav_path)}")
# Load the original waveform
y_original, sr_original = librosa.load(wav_path, sr=sr)
# Convert waveform to mel spectrogram
mel_spectrogram = librosa.feature.melspectrogram(y=y_original, sr=sr_original, n_fft=n_fft, hop_length=hop_length, n_mels=n_mels)
# Convert mel spectrogram back to linear spectrogram
# Need the mel basis to invert
mel_basis = librosa.filters.mel(sr=sr_original, n_fft=n_fft, n_mels=n_mels)
# Use pseudo-inverse to go from mel to linear
linear_spectrogram = np.dot(np.linalg.pinv(mel_basis), mel_spectrogram)
# Convert linear spectrogram back to waveform using Griffin-Lim
y_reconstructed = librosa.griffinlim(linear_spectrogram, hop_length=hop_length, n_fft=n_fft)
display(Audio(y_reconstructed, rate=sr_original))
# prompt: display the generated audio and real audio pairs from the folder outputs_and_original
# Define the directory containing the generated and original audio pairs
output_dir = 'data/outputs_and_original/'
# List all files in the directory
all_files = os.listdir(output_dir)
# Filter for audio files (assuming they are .wav after processing)
audio_files = [f for f in all_files if f.endswith('.wav')]
# Separate files into generated and original based on naming convention
generated_files = sorted([f for f in audio_files if f.startswith('generated_')])
original_files = sorted([f for f in audio_files if f.startswith('real_')])
# Ensure we have pairs to display
min_files = min(len(generated_files), len(original_files))
if min_files == 0:
print(f"No audio pairs found in '{output_dir}'.")
else:
print(f"Displaying up to {min_files} generated and original audio pairs from '{output_dir}'.")
for i in range(min_files):
generated_file_path = os.path.join(output_dir, generated_files[i])
original_file_path = os.path.join(output_dir, original_files[i])
print(f"\n--- Pair {i+1} ---")
print(f"Generated: {generated_files[i]}")
display(Audio(generated_file_path))
print(f"Original: {original_files[i]}")
display(Audio(original_file_path))
print(f"baseline:")
wav_to_mel_to_wav_and_display( output_dir + original_files[i])
print("-" * 30)
Displaying up to 3 generated and original audio pairs from 'data/outputs_and_original/'. --- Pair 1 --- Generated: generated_audio_1.wav
Original: real_audio_1.wav
baseline: Processing: real_audio_1.wav
------------------------------ --- Pair 2 --- Generated: generated_audio_2.wav
Original: real_audio_2.wav
baseline: Processing: real_audio_2.wav
------------------------------ --- Pair 3 --- Generated: generated_audio_3.wav
Original: real_audio_3.wav
baseline: Processing: real_audio_3.wav
------------------------------
Task 2¶
THE TASK: Generate a drum sequence, conditioned on a instrument track
Exploratory Analysis, Data Collection, Pre-Processing, Discussion¶
Context:¶
Where does the dataset come from? What is it for, how was it collected, etc? Our dataset came from the GigaMIDI dataset. It features 1.43M MIDI files (2.1M on HuggingFace). The data gathered from publicly available and user contributed sources, listed in “Data Source Links for the GigaMIDI Dataset” pdf.
A majority portion of the dataset features data from The Drum Percussion Midi Archive, with 800,000 files, and the MetaMIDI Dataset, featuring 433,527 files.
We downloaded the midi files directly form hugging face.
Discussion:¶
Report how we processed the data (or how it was already processed)
The midi dataset comes in 3 chunks–training, validation, and testing. Validation and testing have 10% of the total dataset each, and the training data has 80%. There is no preprocessing that has been done to the dataset. However, the dataset is split into 3 distinct types of midis: drums only, no-drums, and drums + instruments.
But first, we split each midi in the time sequences that roughly corresponds to 512 tokens (more about this in a bit). This is to remove the overhead of slicing sequences into the appropriate length at train time.
Miditok provides a convenient utility function to do this. This step took about 10 hours. We only did this step on the validation and test set because we knew we didn’t have enough compute to run training on the full 1 million+ dataset.
Since our task is to generate a drum sequence, conditioned on the instrument sequence, we only utilize the drums+instrument subset of Gigamidi. To create (drum sequence, instrument sequence) pairs (our training set), we needed to split the all-instruments-with-drums midis into midis that have a single track.
second step: split multi-program tracks into single track songs. Save the drum sequence, and only save the non-drum tracks that would be useful for drum beat generation. These are the following programs:
Acoustic Grand Piano, Acoustic Guitar (steel), Electric Guitar (jazz), Electric Guitar (clean), Electric Guitar (muted), Overdriven Guitar, Distortion Guitar, Acoustic Bass, Electric Bass (finger), Electric Bass (pick) + drums
At this point, a point of consideration was the dataset size. Step one already blew up the number of midi files to about 5 million files (up from about 1 millions files from Gigamidi). Splitting the tracks would likely x4 the number of files again, so we decided to only perform this step on the validation set. This would give us a more managable number of files. This resulted in 1.3 million files.
Lastly, we filter for mids that have a 4/4 time signature. This is done to simplify the problem that out model needed to learn. In addition, having variable time signature would increase our tokenizer’s vocab size, so we limited the dataset.
At this point, we build a JSON file that is formatted like: {base_midi_file: {“drum”: drum_midi_file, “others”: [inst_midi_1, inst_midi2, …]
These are the pre-processing steps we took. After this, we use the dataset to train miditok’s REMI tokenizer. We saved this for later use during training. We do not pre-tokenize the midi’s, as we thought this would not significantly increase the training speed (file I/O would be the bottleneck for this training run)
We considered using a different tokenizer, such as MuMIDI, which is mainly for multitrack midi songs. But this tokenizer outputs a nested list as the token sequence, making things slightly more complicated. Therefore, we decided on the REMI tokenizer.
While MidiTok comes with a DataSet midi class which is a convenient dataset for PyTorch, as we were using a unconventional structure with the above JSON file, and I was unfamiliar with how this dataset would work. As such, we implemented the DrumConditionalDataset in PyTorch on our own. It batches up Batch_size songs, tokenizes and maps them to integer IDs, ensures consistent length, and returns a dictionary holding the final data.
Code:¶
Support our analysis with tables, plots, statistics, etc Sample text here
Modeling¶
Context:¶
How did we formulate the task as an ML problem? E.g. what are the inputs, outputs, what is being optimized? What models are appropriate for the task?
Inputs (X): a REMI‐tokenized representation of a single “instrument” MIDI track (e.g. piano, guitar, bass). Output (y): the corresponding drum track, tokenized with REMI in exactly the same way (except we never include non-drum programs or pitches). In practice, we prepend a BOS_None (bos = beginngn of sentence) token and append an EOS_None (eos = end..) token so that the model has a clear “start” and “end” signal.
What type of model are we using, and why did we choose this type? Type of model: LSTM (Long short term memory) Why this model? Sequential dependencies: Drums have temporal structure (e.g. kick on the downbeat, hi-hat on off-beats, fills every 4 or 8 bars). LSTMs are designed to capture long-range dependencies in a single token stream: the hidden state at time t carries information about all previous beats.
Fixed-memory capacity: By tuning the hidden dimension, we can control how much musical “history” the model retains. In this project, we only need to remember a few dozen tokens back (bar and position encodings help), so an LSTM’s gating mechanisms are sufficient.
Compute and familiarity: We’ve worked with LSTMs before and understand their training dynamics (e.g. vanishing‐gradient mitigation, teacher forcing). A Transformer‐based model would also work, but it requires more memory/compute, and we don’t yet need full self-attention over hundreds of tokens. Because the dataset (4/4, mono tracks) tends to be relatively short, an LSTM strikes the right balance between capacity and speed.
How is the model trained? We use the CrossEntropy loss, comparing the model’s outputs (predicted drum sequence) against the ground truth drum sequence. Each next token is sampled by taking the softmax of the LSTM’s output.
Discussion:¶
Discuss the advantages and disadvantages of different modeling approaches (complexity, efficiency, challenges in implementation, etc.)
We will compare transformers vs LSTM vs other simpler modeling techniques.
Markov models (like homework): easy to code up, fast training, but low complexity, they will struggle to capture long range dependencies
LSTM: training can be low efficiency, as training is done sequentially in time, and inference is also done sequentially. If the sequence lengths are very long, memory usage can increase as well. Can struggle to capture very long dependencies, but works reasonably well with up to several hundred tokens, which is enough for this task. The cell and hidden states can be used to summarize the previous tokens, providing important information for next token prediction.
Transformer: can work with very very long range dependencies, high memory usage due to the attention mechanisms. Our hardware is limited, making it difficult to efficiencnly train. Transformers also tend to be data hungry. With the Gigamidi dataset, it would have been possible to train it but we were limited by hardware.
Code:¶
Walk through your code, explaining architectural choices and any implementation details Sample text here
Evaluation¶
Context:¶
How should your task be evaluated? What should be the properties of a “good” output? What is the relationship between the objective being optimized by your model (e.g. perplexity) and musical properties (e.g. does it follow harmonic “rules”?) versus subjective properties?
For a task like ours a lot of subjectively will play a part. Does the drum beat follow the beat of the instrument? Do the drum beats feel like they go with the instrument track? These subjective metrics would be used in tandem with more concrete metrics like perplexity.
Discussion:¶
What are some baselines (trivial or otherwise) for your task? How do you demonstrate that your method is better than these trivial methods?
Baseline: a model that randomly outputs notes at each timestep. This would not output a coherent beat. We have not implemented this baseline as it would be a waste of time, but we are certain that it would be gibberish. We could implement something liek perplexity to evaluate how good/bar the generations are.
Code:¶
Walk through the implementation of your evaluation protocol, and support your evaluation with tables, plots, statistics, etc.
Sample text here
Discussion of related work¶
How has this dataset (or similar datasets) been used before? This dataset has been used for many symbolic music tasks, including training Midi-GPT, a midi generation transformer that has style control ability How has prior work approached the same (or similar) tasks? Midi-GPT was based on GPT-2, a small transformer model released publiclly by openai, which led to state of the art music generation capabilities How do your results match or differ from what has been reported in related work? (we can put this section at the beginning if we’d prefer)